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U-\ Abstract 

The propensity score analysis is one of the most widely used methods for study- 
ing the causal treatment effect in observational studies. This paper studies treatment 
effect estimation with the method of matching weights. This method resembles propen- 
sity score matching but offers a number of new features including efficient estimation, 
rigorous variance calculation, simple asymptotics, statistical tests of balance, clearly 
identified target population with optimal sampling property, and no need for choosing 
matching algorithm and caliper size. In addition, we propose the mirror histogram 

ID 

as a useful tool for graphically displaying balance. The method also shares some fea- 
tures of the inverse probability weighting methods, but the computation remains stable 
when the propensity scores approach or 1. An augmented version of the matching 
weight estimator is developed that has the double robust property, i.e., the estimator 
j_i is consistent if either the outcome model or the propensity score model is correct. In 



the numerical studies, the proposed methods demonstrated better performance than 
many widely used propensity score analysis methods such as stratification by quintiles, 
matching with propensity scores, and inverse probability weighting. 
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1 Introduction 



Propensity score analysis is an important statistical tool for adjusting for confounding in 
observational studies [IT] , and has been widely used across many research fields such as 
epidemiology, economics, political and social sciences [7J [TJ1 IH] . In this paper, we study the 
population propensity score analysis [22]- Let {Y i: Z iy Xj, i — 1, 2, ...n) be the observed data 
from n independent subjects randomly sampled from a population of research interest, where 
Yi denotes the outcome of research interest, Zi = 1 or indicates whether the subject was 
assigned to the treatment or control, and Xj is a vector of variables related to the treatment 
assignment and the outcome. The research question is to study whether the treatment has 
an effect on the outcome and to estimate that effect quantitatively. 

It is helpful to conceptualize this problem using the potential outcomes framework. We 
assume that the observed outcome for subject i would be Yu if the subject had been assigned 
to the treatment group and Y 0i if the control group. Since the subject can only receive either 
the treatment or the control, Yu and Yoi are potential outcomes that are never observed 
simultaneously in reality. Their relationship with the observed outcome Yi and assignment 
Zi is assumed to be: Yi = Y^Zi + Y 0i (l — Zi). This relationship is called the Stable Unit 
Treatment Value Assumption |20j. It implies that the observed outcome of a subject is 
solely determined by the potential outcomes and the assignment for that subject, and does 
not interfere with data from other subjects. This assumption is satisfied in the setting 
considered in this paper as we assume that the subjects are independent. Another assumption 
needed for propensity score analysis is the assumption of "no unmeasured confounders" or 
"strongly ignorable treatment assignment": (Yu,loi) -L I Xj. It states that Xj must 
include all relevant variables such that conditional on these observed variables, the potential 
outcomes are independent of the treatment assignment. Notation _L denotes independence. 
Variables in Xj are called confounders and this assumption requires that there should be no 
unmeasured confounders. 

The goal of the propensity score analysis is to estimate the effect of the treatment, which 
may be defined for each subject as Aj = E(Yu — Y 0i ), the difference in expected potential 
outcomes of the same subject. If Aj = A, a typical setting studied in many numerical 
studies [TH Hj, the treatment effect is homogeneous. If Aj may be different for different 
subjects, the treatment effect is heterogeneous, and one may study the average causal effect 
A = -E'(Aj), where the expectation is taken over some target population of research interest. 
We consider both cases in this paper and when the treatment effect is heterogeneous, we 
assume Aj = E(Yu — Y 0i | Xj) = A(Xj), for a function A(.) of the confounders. 
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The propensity score is defined for each subject to be the conditional probability of receiving 
the treatment, given confounders, i.e., = Pr(Zj = 1 | X<). Rosenbaum and Rubin [T7] 
proved Xj _L Z$ \ and (Yii,Y 0i ) J_ Z^ | ej, which imply that the treatment may be 
viewed as being randomly assigned to subjects with the same propensity score. Therefore, 
one can intuitively think of the entire data set as a collection of many tiny randomized 
experiments, each defined on a distinct value of the propensity score. An estimator for 
the causal treatment effect may be formed by properly aggregating results from these tiny 
experiments. 

Popular methods that use propensity score to estimate the treatment effect include stratifica- 
tion or regression [T51IT]. matching [HUE], inverse probability weighting [13], or a combination 
of them [11]. This paper studies a new approach, the method of matching weights. Section 
[2] introduces the estimator, and discusses its asymptotic properties, estimand, computation, 
and balance diagnosis. It shares some features of both the propensity score matching and 
the inverse probability weighting and avoids some of their drawbacks. Section [3] developed 
an augmented matching weight estimator that is double robust and efficient within a class 
of asymptotically linear estimators. The theoretical development parallels that of the in- 
verse probability weighting method [15] . Section [1] presents numerical studies to compare 
the proposed estimators with competing estimators. 



2 Matching Weight Estimator 



e t = e(X l ,(3) = Pi(Z l = l | X, 



The propensity score is often estimated by a logistic regression of Z{ on X^: 

exp{Xf/3} 
l + exp{Xf/3}' 1 ] 

We call ([TJ the propensity score model. Throughout this paper, the term "propensity score" 
refers to ej on its probability scale, i.e., < < 1, unless otherwise specified. The propensity 
score can not be or 1, otherwise the subject can not potentially be assigned to both 
treatments, and one of the potential outcomes is undefined. 

We define the matching weight for subject i as 

W = min(l - e h g) 

1 z^ + a-^xi-e,,)- [) 

The matching weight estimator is 
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The matching weight is a modification of the inverse probability weight with min(l — e^, e^) 
placed in the numerator, which prevents the weight to be excessively large when e, approaches 
or 1, and stabilizes the estimator and improves its efficiency Here is the intuition behind 
this estimator. Suppose we focus on a small stratum of m subjects with propensity scores 
very close to eo- Then we would expect that there are roughly moeo treated subjects and 
mo(l — eo) controls. When eo < 0.5, there are more controls than the treated subjects in 
this stratum, hence we give less weights to the controls in order to achieve balance. When 
eo > 0.5, there are more treated subjects than controls, and we give less weights to the 
treated subjects. 

Another modification is to give the treated subjects weight 1, and controls weight ej/(l — ej), 
leading to an estimator of the average treatment for the treated [TO]. However, this weight 
may still be very large and unstable when e^ is close to 1, reflecting the difficulty to recover 
information about Y$i when mostly likely we can only observe Yu. 

Since the matching weight is between and 1, it can be viewed as a sampling probability: 
the treated and control subjects are sampled with sampling probabilities that depend on the 
propensity scores. Consequently, Amw is the average difference in mean outcomes of the 
sampled subjects. We define the effective sample size of the sampled subjects in the treatment 
group as Yli=i WiZi and, for the controls, Y^i=i — Z^). They are asymptotically equal. 
The following result further characterizes the property of the populations sampled by the 
matching weights. 

Proposition 1. Let /(e) be the density function of the propensity score e, and So(e) and 
Si(e) be the sampling probabilities for the subjects with Z = and Z = 1 such that both the 
expected effective sample sizes and the distributions of the propensity score are asymptotically 
equal between the sampled subgroups with Z = and Z — 1. Then 

S'o(e) < min(l — e, e)/(l — e) and Si(e) < min(l — e, e)/e 

and the equality holds simultaneously. 

This result shows that the matching weight is optimal in the sense that it maximizes the 
sizes of the sampled subgroups while keeping them balanced in their effective sample sizes 
and their distributions of the propensity score, and hence X. The density function of the 
propensity score is identical for the two subpopulations sampled by the matching weights: 



Jf(u)mm(l — u,u)du 

We call them maximal balanced subpopulations. The matching weight is very similar to 
propensity score matching. First, they both produce weighted subgroups that have similar 




< e < 1 
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distributions in the propensity score and confounders, and similar effective sample sizes. 
Second, they both let each subject in the data to be under-represented in the sense that 
they are weighted by a number that is non-negative and no more than 1. One difference is 
that with matching, each subject receives a weight of 1 (matched) or (unmatched), while 
with matching weight, all subjects are retained and we deal with their probabilities of being 
matched instead of deciding who can be matched and who can not. Another difference is 
that matching weight is calculated for each subject independently, but matching derives the 
weights from a matching algorithm, which may introduce complicated dependency between 
matched subjects. Alternative weighting methods, such as the inverse probability weighing, 
allow each subject in the data to be over-represented in the sense that their weights are 
bigger than 1. 

Proposition 2. Assume that the propensity score model is known. When n — > oo, we have 
2 -E{min(l — ej, ej)A(Xj)} A s 

Amw^ p l pf \ * 1 \* JI = Ao and ^{A MW ~ A ) ^ d N(0,V MW ) 
£,{mm(l — ei, ti)) 

where 

[min(l - e u ei)(Y u - ,Ui)] 2 / e i + [min(l - e u e,i)(Y Gi - [i )] 2 - e*)} 



V, 



mw — " r"2 



£[min(l - e h ei)} 
with Hx = J E(Y U | ei)f*(ei)dei and fi = J E(Y 0i | e i )/*(e i )dej. 

This result establishes the asymptotic distribution of the matching weight estimator. Un- 
der the special case A(X) = A, Ao = A and the matching weight estimator consistently 
estimates the treatment effect. Under heterogeneous conditions, its estimand is the aver- 
age treatment effect over the maximal balanced subpopulations. In terms of the estimand, 
the matching weight method is again similar to propensity score matching, but with an 
advantage: the subpopulation on which the average treatment effect is defined is unique 
and optimal in the sense of Proposition 1. In propensity score matching, often a subset of 
the treated and control subjects are selected into the matched data, and the distribution 
of matched data depends on the matching algorithm and the caliper size [5]. It is unclear 
what are the subpopulations under comparison. The inverse probability weighting has the 
marginal structural model interpretation [16] and it estimates the average causal effect over 
the entire population under both homogeneous and heterogeneous conditions. However, this 
goal is achieved at some cost. Although it is undesirable to include in the study subjects 
that can almost only be assigned to the treatment or control, in practical situations, the 
propensity score is often unknown and must be calculated from a parsimonious mathemati- 
cal model. Sometimes the calculated propensity scores are very close to or 1. In such case, 
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the method has to use few treated or control subjects to "recover" the virtual populations 
in which all subjects received the control, or the treatment. This is done by weighting the 
data with the inverse of small probabilities. Data sparsity like this may result in large loss of 
efficiency and unstable calculation [13]. Under homogeneous conditions, there is no need to 
estimate the marginal mean of the potential outcomes in order to get to an estimator of the 
treatment effect. In this case, the estimands of the matching weight and inverse probability 
weighting methods are identical. 

We can estimate the matching weights jointly with the treatment effect by solving the fol- 
lowing estimating equations with respect to 6 = (/ii, /xq, P T ) T : 



n 



o = i>(0) = E 



=1 



W(X i ,Z i ,/3)Z i (Y i - t , 1 ) 
WiXttZitPKl-ZiHYi-to) 

e(X i ,/3)[l-e(X ! ,/3)J 



(4) 



where e / g(X i ) = <9e(Xj,/3)/9/3 and we rewrite Wi as W(X.i, (3). The matching estimator 
^■mw — /*i — Aoj which converges to A asymptotically. Similar to the inverse probability 
weighting [H], this is a one-step approach that properly accounts for uncertainty with the 
propensity score model. The stratification and matching methods used in practice are typi- 
cally two-step approaches: the propensity score model is fit in the first step, and treatment 
effect is estimated in a second step without adjusting for the uncertainty and correlation in 
estimated propensity scores in the first step. 

The variance of the matching weight estimator is calculated from the sandwich method as 
var(A Mlv ) = n~ 1 A~ 1 ~B n A~ T , with A n = n' 1 J2tt d4n(e)/d0 and B n = n" 1 ££ =1 <M#)<M#) T 
One issue remains to be resolved. Since the matching weight function ^ does not have con- 
tinuous derivative at t{ = 0.5, W(Ki,Zi, /3) is not everywhere differentiable with respect to 
(3. Since Q equals to 771(e) = min(l — e, e)/e when Z = 1 and r/ (e) = min(l — e, e)/(l — e) 
when Z = 0, we solve this problem by replacing the middle piece in 771(e) and 770(e) around 
0.5 with a cubic polynomial that connects smoothly with the two ends. The result is an 
approximate matching weight function with continuous first derivative everywhere, which 
satisfies the usual regularity conditions for sandwich variance estimation. Since the middle 
piece can be made arbitrarily small, the approximation is quite accurate. 

We first approximate 771(e). Let 77*(e) = 771(e) if e £ (0,0.5 — 5) U (0.5 + 5, 1) and 77*(e) = 
ao + cue + ci2e 2 + a 3 e 3 if e £ [0.5 — 5, 0.5 + 5]. In order for rj\ (e) to have continuous first 
derivative everywhere and adequately approximate 771(e), {a ,ai,a 2 ,a 3 } must satisfy four 
conditions: (1) 77* (0.5 - 5) = 1 (2) 77*' (0.5 - 5) = (3) 77* (0.5 + 5) = (1 - 26) /(l + 25) (4) 
77*' (0.5 + 5) = — 4/[(l + 25) 2 ]. Here notation /'(.) denotes the first derivative of the function. 
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Solving these four equations for a -a 3 , we have: 



[do, dl, 0,2, CL3) 



D _1 ( 1, 0, 



1-25 



-4 



1 + 25' (1 + 25) 2 



with 



D 



/ 1 0.5-5 (0.5 -5) 2 (0.5 -5) 3 \ 

1 2(0.5-5) 3(0.5 -5) 2 

1 0.5 + 5 (0.5 + 5) 2 (0.5 + 5) 3 
\ 1 2(0.5 + 5) 3(0.5 + 5) 2 / 

Similarly, we can define ^o( e ) = ^(e) if e G (0, 0.5 — 5) U (0.5 + 5, 1) and rj^e) 
b 2 e 2 + 63c 3 if e 6 [0.5 — 5, 0.5 + 5]. Then 770(e) approximates 770(e) with 



6 + he + 



D 



i/1-25 



1,0 



+ + 25' (1 + 25) 2 ' ' 
In all the numerical studies in this paper, we set 5 = 0.002. 

Balance diagnosis, i.e., checking whether Xj + Zi\e{ holds, is an attractive feature of the 
propensity-score analysis in comparison with direct regression on the outcome |HJ EE]- In 
the propensity score matching paradigm, the general recommendation is to calculate the 
standardized difference, i.e., the absolute difference in mean divided by a pooled variance 
[3]. If the standard difference becomes small enough after matching, that would suggest 
balance. However, there has been no widely accepted guidance on "how small is being small 
enough". Another issue is that although the standardized difference is very similar to the 
t-statistic, one can not use it to test for balance, because the sample size reduction after 
matching alone could reduce the significance of these tests. Exceptions to this rule exist 
[S]. In the inverse probability matching paradigm, confounders should be balanced after 
weighting. However, when some propensity scores are close to or 1, the balance diagnosis 
statistics may be highly variable and balance is difficult to ascertain. 

With the matching weights, we argue that balance can be checked with statistical tests. The 
presumption is that if the propensity score model is correct, then the confounders, weighted 
by the matching weights, should be balanced between the treated and control groups; on the 
other hand, if the propensity score model is misspecified, imbalance may show up in some 
confounders. This is like a check for propensity score model mis-specification and the null 
hypothesis is that the model is correctly specified. Let Xj be a confounder whose balance we 
want to examine and g{Xi) a pre-defined function of X{. The balance diagnostic statistic is: 



B 
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If the propensity score model ([TJ is correctly specified, each of the two terms in this expression 
converges to their means with respect to the maximal balanced subpopulations, denoted by 
fisi an d //bo 5 an d they are identical. Hence, we would expect B to converge to as n — > oo. 
To study its variance, we can again formulate the estimating equation as: 



E 



( W^Z^^ZiigiX^-fiBi) \ 
W(X i ,Z i)/ 5)(l-^)(^)-// S0 ) 

\ e(X iw S) [l-epti./fljj M / 



and the variance of B follows from sandwich method. We may set g(x) = x for checking any 
imbalance in mean, or set g(x) = x 2 for checking imbalance in the second moment, etc. This 
estimating equation can take vector- valued g(.) and X{ so that several confounders can be 
considered jointly. We can also define g(Xii,X 2 i) = XuX 2 i to study any imbalance in the 
correlation structure. 

One caveat with theses tests is that, even when the propensity score model is correct, if 
many tests are performed, some tests will be significant by chance. Therefore, one should 
be cautious when many tests are used simultaneously. At the very least, the significance 
thresholds of these tests offer the data analyst a rough benchmark on whether the imbalance 
is small enough. If many confounders turn out to be significant, it may be a sign that the 
propensity score model needs improvement. It may be possible, still within the M-estimation 
framework, to develop an overall test of imbalance with properly controlled type I error. That 
is an on-going work and beyond the scope of this paper. 

Nearly three decades after the propensity score was proposed, there is still "rampant lack 
of good practice in propensity score matching applications" [9j. There are a number of rea- 
sons. First, the asymptotic theory of propensity score matching is non-standard and very 
complicated [1], making it difficult to study its properties or develop theoretically justified 
guidelines. For example, matching depends on caliper choice but theoretically justified op- 
timal caliper still needs to be developed [5]. Second, matching may introduce correlation 
between the same matched pairs, fitting a propensity score model in the first step may in- 
troduce correlation between different matched pairs, and complicated dependence may also 
arise if matching is done with replacement. These correlations are often ignored or inade- 
quately adjusted during the analysis of the matched data, and accurate variance estimation 
is often not available. General-purpose variance estimation methods, such as the bootstrap, 
does not apply to matching estimators [2J. As a result, there remains debate on whether 
unpaired or paired analysis of the outcome is more appropriate for matched data [31 123] . 

The method of matching weights resembles the propensity score matching, but its asymptotic 
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theory is much simpler. It is a one-step approach so that the propensity score and the 
treatment effect can be estimated simultaneously from a set of estimating equations, and 
accurate analytical variance formula is available and justified. It is somewhat subjective to 
choose the matching algorithm or caliper and the practice varies among data analysts. This 
task is no longer needed with the matching weight method. 

Under homogeneous conditions, the matching weight estimator and inverse probability weighted 
estimator are similar when the propensity scores are not too close to or 1. The histogram 
of the propensity scores concentrates more in the middle range of unit interval (Figure [T]). 
When some propensity scores become extreme, the two estimators begin to diverge and the 
matching weight estimator is more efficient. This is observed in the simulation of Section 
[4| Under heterogeneous conditions, the two estimators have different estimands and are not 
comparable. In such case, an estimator obtained from one study may not be generalizable 
to other study populations. What is more likely to be generalizable, is the result of a test of 
the null: "the treatment effect is zero for all subjects" versus the alternative "the treatment 
effect varies among subjects". Assuming the statistical power is always adequate and the 
treatment truly affects the outcome, then the treatment effect should shown up, more or 
less, in different studies, regardless of the population on which the average treatment effect 
is defined. The test can be performed with the inverse probability weighting method and 
with the entire subject population as the target population. It can also be performed with 
the matching weight method and with the maximal balanced subpopulation as the target 
population. These tests will be studied in Section [4} The discussion above assumes that 
under the alternative the treatment effect is either positive and varies, or negative and varies 
among subjects. If the treatment effect is positive on some subjects and negative on other 
subjects, then neither inverse probability weighting, nor matching weights can guarantee 
enough statistical power under the alternative. 



3 Double Robust Matching Weights Estimator 

In this section we develop an augmented matching weights estimator that has the "double 
robust" property. The idea follows from the inverse probability weighted double robust esti- 
mator [HUH]- In addition to the propensity score model, the augmented estimator involves 
two outcome models, one for the regression of Yi on Xj among the treated subjects, and one 
for the controls. Let a.\ be the parameters associated with the outcome model for the treat- 
ment group, we write mi(Xj,ai) = E(Yi\X.i, Zi = 1) as the conditional expectation of the 
outcome given the covariates, and write Si(Yj, Xj, a\) as the unbiased estimating equation 
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for ot.\ derived from the likelihood or quasi-likelihood of this outcome model. For the control 
group, we define notation m (Xj,a ) and S (^,Xj,a ) similarly. 



The augmented matching weights estimator is: 

Z)^x Wi{mi(Xi,ai) - m (Xi,a )} 



k MW,DR 



(5) 



Proposition 3. T/ie augmented matching weight estimator Amw.dr is consistent for Aq, 
as long as at least one of the following two models are correctly specified: (1) the propensity 
score model pi); (2) the outcome models m 1 (X,ai) and m (X,ao). 



Proposition 4. Assume that the propensity score model (T/J) is known and let = 
£7{min(l — e^, e*)}. The class of influence functions of regular asymptotically linear esti- 
mators for A is given by (subscript i suppressed) 



; min(l-e,e) rzr _ [L-Z)Y i _ a \ + a 
I I e 1-e \ °J 



ZY (1 - Z)Y 



where A is the space of functions of form {Z — e}/i(X) for any function /i(X). Among all 
estimators with influence functions in this class, the augmented matching weight estimator 
is the most efficient in the sense that it has the smallest variance. 



The proof is similar to §13.5 of [25J. The property described in Proposition 3 is called 
double robustness. In usual statistical models, if a model is misspecified, the result is usually 
biased. With double robustness, even if one part of the model fails, we may still get unbiased 
estimator with the other part of the model. Therefore, it gives the data analyst two chances, 
instead of one, to get a correct result. Proposition 4 shows the benefit of adding the outcome 
models: we will arrive at a more efficient estimator. 

Double robustness has been established for inverse probability weighting method [14J, but not 
for the other propensity score analysis methods. Ho et al [11J and Stuart [23] mentioned that 
doing a regression of the outcome with propensity score matched data leads to double robust 
estimation, but they did not give any theoretical justification of that claim. Proposition 3 
can be used to support that claim, given the similarity between the matching weight method 
and matching. 

Estimator ^ involves unknown parameters oti, ao, and (3. In practice they can be replaced 
by consistent estimators and Proposition 3 still holds. Let fii, fi 2 , and fi^, be the asymptotic 
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limits of the three terms in (j5j). The estimating equations for (/ii, /i 2 , ^3, af , aft, (3 T ) T are: 

/ Vy(X i ,Z i ,/3){mi(X i ,ai) -m (Xi,a ) - A*i} \ 
W(X«, Z is - miCXi, an) - /x 2 } 

« W(X,, Z«, - Z,){y, - m (X i; a ) - /i 3 } 
0= ^ Si^X*, an) 

i=l 

S (li,Xi,ao) 
g^fe^ ^ (X,) 

Solving these estimating equations jointly, we have Amw,dr — fix + /2 2 — /*3- The variance 
of Amw,db can be calculated by the sandwich method. 



4 Numerical Studies 

We conducted simulations to study the numerical performance of the matching weight es- 
timator and double robust matching weight estimator, and compare them with three other 
types of propensity score analysis methods: stratification, matching, and inverse probabil- 
ity weighting. For stratification, we used five strata, as this is a popular choice in data 
analytical practice [H HE] . For propensity score matching, we used the R package Matchlt 
(http:/ /gking. harvard.edu/mat chit) and the optimal caliper size recommended by Austin [5], 
which equals to 0.2 times the standard deviation of the propensity score on its logit scale. 
To study the sensitivity of matching results to caliper size, we also consider a smaller caliper 
at 0.1 and a bigger caliper at 0.3. For inverse probability weighting (IPW), we studied the 
double robust IPW estimator and IPW3 [H] . The IPW3 is the inverse probability weighting 
with stabilized weights, which has improved efficiency and numerical stability over simple 
IPW methods. 

The propensity score model is a logistic regression logit{Pr(Zj = l|Xj)} = Xf/3. The 
outcome model is a linear regression Yi = AZi + Xf a + 6{ with A = 2, a = (1, 2, —1, —2, 1) T 
and e 4 ~ iV(0,2 2 ). X 4 = (X 0i , X H , X 2i , X 3i , X 4i ) T . X 0i = 1. X u and X 2l ~ N(0, 1). X 3l 
and X 4 j ~ 2 x Bernoulli(0.5). Xu-X^ are independent. We consider three scenarios: (1) 
= (-1, 0.4, 0.2, 0.4, 0.2) T ; (2) (3 = (-2, 0.8, 0.4, 0.8, 0.4) T ; (3) (3 = (-3, 1.5, 0.75, 1.5, 0.75) T . 
Under these scenarios, the proportion of subjects with Z = 1 is between 35% and 40%, and 
var(yj)/var(ej) is between 3.5 and 3.8. The sample size is n = 1000, and each simulation is 
based on 1000 Monte Carlo replicates. 

The mirror histograms in Figure [T] illustrate the difference among the three scenarios. Each 
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mirror histogram consists of four histograms. The two outside are the ones corresponding 
to the distribution of propensity scores of the treated (below) and control (above) subjects. 
Nested within them are two histograms in color, for which each subject is weighted by Wi, 
their matching weights. Since the matching weights result in maximal balanced subgroups, 
the two nested histograms are like reflections in a mirror. The propensity score matching 
can also be illustrated in mirror histogram: the nested histograms are created by giving 
weight 1 to matched subjects and to unmatched subjects. The plot is similar to Figure [T] 
for the optimal caliper size and is omitted. Mirror histogram can also be used to compare 
continuous confounders, before and after applying the matching weights. Scenarios 1-3 rep- 
resent increasing imbalance between the treated and control groups, and increasing number 
of subjects with propensity scores close to or 1. 

Table [j] compares 13 point estimators. They are from: (1) true outcome regression; (2) 
stratification by five strata; (3) matching with caliper 0.1; (4) matching with caliper 0.2, the 
optimal caliper; (5) matching with caliper 0.3; (6) IPW3; (7) double robust IPW estimator; 
(8) matching weight (MW) estimator; (9) MW estimator with incorrect propensity score 
model; (10) double robust MW estimator; (11) double robust MW estimator with incorrect 
propensity score model; (12) double robust MW estimator with incorrect outcome model; 
(13) double robust MW estimator with incorrect propensity score model and outcome model. 
The incorrect propensity score model is the logistic regression with only X\ and X 2 as 
covariates. The incorrect outcome model is the linear regression of Y with X\ and X3 as 
covariates. Hence, they represent situations where some confounders are ignored. 

From Table [T] we have the following observations. First, although the matching weight 
method resembles matching, it is more efficient and has less bias than than the matching 
estimator, though its effective sample size is smaller than the sample size of the matched 
data set. If there are several subjects within the caliper of the propensity score, the matching 
algorithm chooses the one with the closest propensity score and may discard others. This is 
like giving weight 1 to the matched subject and giving no weight to others. The matching 
weight method retains every subject in the neighborhood, and giving them roughly equal 
weights so that they all contribute to averaging the outcome. This increases numerical stabil- 
ity and efficiency, and reduces bias. Second, the results from IPW methods are comparable 
with the MW methods in Scenario 1, when the treated and control groups have nearly bal- 
anced propensity score distributions. This is also the situation where the propensity score 
can not be too close to or 1. However, with moderate (Scenario 2) and severe (Scenario 
3) imbalance, the IPW methods may suffer from large loss of efficiency. In fact, the match- 
ing weight estimator without augmentation is even more efficient than the double robust 
IPW estimator in these scenarios. Third, the simulation results support the double robust 
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and efficiency properties stated in Propositions 3 and 4. Interestingly, even when both the 
propensity score model and the outcome models are incorrect, the double robust matching 
weight estimator is still better than the matching weight estimator with incorrect propensity 
score model and without augmentation. Therefore, it seems that the double robust match- 
ing weight estimator should always be recommended in data analytical practice. Finally, 
the stratification method clearly has larger bias than other methods, and is not efficient 
in general. This observation agrees with that in [14J. The stratification does not produce 
consistent estimators, and there is still a lack of guidance on whether the number of strata 
should increase with the sample size. Table [2] shows the empirical coverage probabilities of 
95% confidence interval of various matching weight estimators. The result indicates that the 
sandwich variance formula is accurate. 

We conducted a simulation under the heterogeneous treatment effects, based on the dis- 
cussion at the end of §2. We modify the outcome model as Yi = AjZj + Xfa + e«, with 
Aj = 6(2.5 + 0.5Xu — 0.5X 3i ). This setup ensures that the treatment has a positive effect 
on almost every subject (except < 0.1% of the cases), but the magnitude of the effect varies. 
The parameter 9 controls the overall size of the effect. When 9 = 0, the treatment has no 
effect on the outcome. Table [3] presents two-sided 0.05-level test of the null hypothesis 9 = 0. 
The test statistic is a Wald type statistic with the matching weight estimator divided by its 
sandwich standard error. The type I error is correct even when the sample size is as low as 
200. In comparison, the double robust IPW estimator has inflated type I error and lower 
statistical power at these sample sizes. The rejection probability under the alternative likely 
depends on how the treatment effect varies with subjects, and hence it is difficult to study 
theoretically and draw generalizable conclusions. 
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Scenario 1 Scenario 2 Scenario 3 



Method 


bias 


var 


MSE 


ESS 


bias 


var 


MSE 


ESS 


bias 


var 


MSE 


ESS 


lrbest 


0.1 


100 


100 




0.7 


100 


100 




0.1 


100 


100 




2:strt 


2.6 


115 


131 




4.9 


185 


227 




6.4 


546 


612 




3:M 0.1 


0.7 


183 


184 


726 


1.1 


217 


217 


538 


-1.3 


285 


r* o *"t 

287 


390 


4:M opt 


0.6 


170 


170 


758 


1.3 


191 


1 AO 

193 


571 


-3.2 


244 


261 


429 


5:M 0.3 


0.4 


164 


164 


784 


1.5 


197 


199 


603 


-5.7 


231 


284 


469 


6:IPW3 


0.1 


110 


110 




1.0 


199 


200 




4.0 


512 


538 




7:DR IPW 


0.0 


103 


103 




0.4 


146 


145 




0.1 


494 


494 




8:MW 


0.1 


106 


106 


714 


0.6 


115 


114 


520 


0.1 


130 


130 


366 


9:MW p 


-29.8 


225 


2254 




-55.0 


197 


5723 




-87.0 


157 


12652 




10:DR MW 


0.1 


102 


102 




0.6 


105 


105 




0.2 


113 


113 




11:DR MW p 


0.0 


101 


101 




0.6 


106 


106 




0.3 


106 


106 




12:DR MW y 


0.1 


104 


104 




0.6 


108 


108 




0.2 


116 


116 




13:DR MW py 


9.4 


126 


327 




17.9 


127 


708 




25.7 


130 


1217 





Table 1: Compare estimators on bias, variance, mean squared error (MSE) and effective 
sample size (ESS). Bias is expressed as % difference from the true value A = 2. Variance 
and MSE are expressed as a percentage of those of Method I. 



Method 
8:MW 
9:MW p 
10:DR MW 
11:DR MW p 
12:DR MW y 
13:DR MW py 



Scenario 

1 2 3 

94.2 93.9 95.0 

16.0 0.1 0.0 

94.1 94.4 94.8 

94.3 94.7 95.1 

94.2 94.3 94.4 

74.3 48.2 24.8 



Table 2: Coverage probabilities (%) of matching weight estimator 
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8:MW 10:DR MW 7:DR IPW 



9 


n = 200 


n = 600 


n = 200 


n = 600 


n = 200 


n = 600 





4.7 


4.7 


5.2 


5.1 


6.8 


7.5 


0.25 


24.6 


65.4 


30.9 


67.3 


28.2 


61.4 


0.50 


75.2 


99.9 


79.5 


99.8 


75.3 


98.4 



Table 3: Rejection probabilities (%) under heterogeneous conditions (Scenario 2) 



Scenario 1 Scenario 2 Scenario 3 




i 1 1 1 1 1 i 1 1 1 1 1 i 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Propensity Score Propensity Score Propensity Score 



Figure 1: Mirror histograms illustrating the propensity scores and matching weights for the 
three simulation scenarios. Below horizontal zero line: Z — 1; above: Z = 0. 
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